174 ◾ Bioinformatics
The HTSeq-count output file contains the feature count for each sample as shown in
Figure 5.3. The feature count file includes tab-delimited columns for gene symbols, tran-
script IDs, and a count column for each sample. We can notice that some genes have zero
reads aligned to them. Later, we will filter out the genes that have no aligned reads and the
one with low coverage.
5.3.5 Normalization
In general, when we analyze gene expression data, we may need to normalize it to avoid
some biases that may arise due to the gene lengths, GC contents, and library sizes (the total
number of reads aligned to all genes in a sample) [21]. The normalization of count data is
important for comparing between expression of genes within the samples and between
different samples. The normalized gene length fixes the bias that may affect within-sample
gene expression comparison. It is known that a longer gene would have a higher chance to
be sequenced than a shorter gene. Consequently, a longer gene would have a higher number
of aligned reads than a shorter one at the same gene expression level in the same sample.
The GC content also affects within-sample comparison of gene expressions. The GC-rich
and GC-poor fragments tend to be under-represented in RNA-Seq sequencing, and hence,
the gene with the GC content closest to 40% would have higher chance to be sequenced
[22]. The library size affects the comparison between the expressions of the same gene in
different samples (between-sample effect).
There are several normalization methods for adjusting the biases resulted from the
above-mentioned possible causes. Choosing the right normalization method depends on
whether the comparison is within-sample or between-samples. In the following, we will
discuss some of these normalization methods used by gene expression analysis program
like EdgeR and DESeq2 [23].
5.3.5.1 RPKM and FPKM
RPKM [24] (the reads per kilobase of transcript per million reads mapped) is a normalized
unit for the counts of reads aligned to genes (normalized gene expression unit). It scales
the count data by gene length to adjust for the sequencing bias arising from the differences
in gene lengths. The RPKM is used for within-sample gene expression comparison (i.e.,
comparison between genes in the same sample).
Assume that N reads are aligned to the reference sequence and only k reads are aligned
to the gene i of length li bp, the PRKM of the gene i is calculated as
k
l
N
i
i
i
=
RPKM
10
10
3
6
(5.1)
In the denominator of Formula 5.1, the length of gene (l) (in base) is divided by 1000 to
be in kilobase and the total number of reads aligned to the reference sequence is divided
by 1000,000 (million). When the number of reads aligned to a gene is divided by the